1 MVP

We’ve looked at a few different ways in which we can build models this week, including how to prepare them properly. This weekend we’ll build a multiple linear regression model on a dataset which will need some preparation. The data can be found in the data folder, along with a data dictionary

We want to investigate the avocado dataset, and, in particular, to model the AveragePrice of the avocados. Use the tools we’ve worked with this week in order to prepare your dataset and find appropriate predictors. Once you’ve built your model use the validation techniques discussed on Wednesday to evaluate it. Feel free to focus either on building an explanatory or a predictive model, or both if you are feeling energetic!

As part of the MVP we want you not to just run the code but also have a go at interpreting the results and write your thinking in comments in your script.

Hints and tips

  • region may lead to many dummy variables. Think carefully about whether to include this variable or not (there is no one ‘right’ answer to this!)
  • Think about whether each variable is categorical or numerical. If categorical, make sure that the variable is represented as a factor.
  • We will not treat this data as a time series, so Date will not be needed in your models, but can you extract any useful features out of Date before you discard it?
  • If you want to build a predictive model, consider using either leaps or glmulti to help with this.

1.1 Researching and prearing our data

Here is what we found looking for information on the ‘avocado’ data. I am accepting this info as reliable.

“The table represents weekly retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.”

Relevant info for understanding ‘obscure’ variable names:

AveragePrice - the average price of a single avocado Region - the city or region of the observation, i.e. where avocados were sold. Total Volume - Total number of avocados sold 4046 - Total number of small avocados sold (PLU 4046) 4225 - Total number of medium avocados sold (PLU 4225) 4770 - Total number of large avocados sold (PLU 4770)

Apparently average price recorded here is not related to bag size so we can drop these variables. Also region doesn’t seem to have a direct relation with average price so it may be safe and beneficial to drop it too.

the x1 variable records the week in which sales were recorded in a 52 weeks per year format. Although our brief is not interested in time series and forecasting we can investigate if seasonality has an impact on average price. Avocados are very sensitive to variations in temperature so weather patterns may impact production and potentially prices. We have decided to keep only data for years 2015 - 2017 dropping partial 2018 data. This could help especially if seasons play some role on average price.

So, we’ll focus on ‘average price,’‘type’ and ‘total volume’. We’ll use ‘x1, ’date’ and ‘year’ to engineer variables which will enable us to explore seasonality.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggfortify)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(modelr)

1.1.1 cleaning var names and subsetting

avocado_df_exp <- read_csv("data/avocado.csv") %>% 
  clean_names() %>% 
  select(x1:x4770, type:year) %>% 
  rename(week = "x1",
         small = "x4046",
         medium = "x4225",
         large = "x4770") %>% 
  filter(date <= "2017-12-31")
## Warning: Missing column names filled in: 'X1' [1]
## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_double(),
##   Date = col_date(format = ""),
##   AveragePrice = col_double(),
##   `Total Volume` = col_double(),
##   `4046` = col_double(),
##   `4225` = col_double(),
##   `4770` = col_double(),
##   `Total Bags` = col_double(),
##   `Small Bags` = col_double(),
##   `Large Bags` = col_double(),
##   `XLarge Bags` = col_double(),
##   type = col_character(),
##   year = col_double(),
##   region = col_character()
## )
avocado_tidy <- avocado_df_exp %>%
  mutate(month = as.character(month(date))) %>% 
  mutate(season = case_when( 
           month == "12" | month == "1" | month == "2" ~ "winter",
           month == "3" | month == "4" | month == "5" ~ "spring",
           month == "6" | month == "7" | month == "8" ~ "summer",
           month == "9" | month == "10" | month == "11" ~ "autumn")
         ) %>% 
  mutate(type = as.factor(type)) %>% 
  mutate(season = as.factor(season)) %>% 
  mutate(year = as.factor(year)) %>% 
  #mutate(week = as.factor(week))
  select(-date)

We suspect ‘total volume’ to be strongly correlated to avocado sizes so we test and if so drop avocado sizes.

avocado_tidy %>% 
  select(total_volume:large) %>% 
  ggpairs()

avocado_tidy <- avocado_tidy %>% 
  select(-c(small, medium, large))

Let’s look at summary statistics

summary(avocado_tidy)
##       week       average_price   total_volume                type     
##  Min.   : 0.00   Min.   :0.44   Min.   :      85   conventional:8478  
##  1st Qu.:13.00   1st Qu.:1.10   1st Qu.:   10460   organic     :8475  
##  Median :26.00   Median :1.37   Median :  104849                      
##  Mean   :25.66   Mean   :1.41   Mean   :  834110                      
##  3rd Qu.:39.00   3rd Qu.:1.67   3rd Qu.:  423186                      
##  Max.   :52.00   Max.   :3.25   Max.   :61034457                      
##    year         month              season    
##  2015:5615   Length:16953       autumn:4212  
##  2016:5616   Class :character   spring:4320  
##  2017:5722   Mode  :character   summer:4210  
##                                 winter:4211  
##                                              
## 

total volume is extremely skewed so this will affect our models. We need to look into this.

total_vol_by_type <- avocado_tidy %>% 
  group_by(type) %>% 
  summarise(avg_total_vol= mean(total_volume)) %>%
  mutate(pct = prop.table(avg_total_vol) * 100)
## `summarise()` ungrouping output (override with `.groups` argument)
total_vol_by_type

More than 97 % of avocados in the data is conventional. It makes sense to focus on this type for average price modelling

avocado_tidy_conv <- avocado_tidy %>% 
  filter(type == "conventional") %>% 
  select(-type)
avocado_tidy_org <- avocado_tidy %>% 
  filter(type == "organic") %>% 
  select(-type)

1.2 Visualising our data

both_types <- ggplot(avocado_tidy) +
 aes(x = total_volume, y = average_price) +
 geom_point(size = 1L, colour = "#0c4c8a") +
 geom_smooth(span = 0.75) +
 scale_x_continuous(trans = "log") +
 scale_y_continuous(trans = "log") +
 labs(title = "Average price decreases when Total Volume increseas") +
 theme_minimal()

conventional <- ggplot(avocado_tidy_conv) +
 aes(x = total_volume, y = average_price) +
 geom_point(size = 1L, colour = "#0c4c8a") +
 geom_smooth(span = 0.75) +
 scale_x_continuous(trans = "log") +
 scale_y_continuous(trans = "log") +
 labs(title = "Average price decreases when Total Volume increseas") +
 theme_minimal()

organic <- ggplot(avocado_tidy_org) +
 aes(x = total_volume, y = average_price) +
 geom_point(size = 1L, colour = "#0c4c8a") +
 geom_smooth(span = 0.75) +
 scale_x_continuous(trans = "log") +
 scale_y_continuous(trans = "log") +
 labs(title = "Average price decreases when Total Volume increseas") +
 theme_minimal()

both_types
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

conventional
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

organic
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(avocado_df_exp) +
 aes(x = date, y = average_price, colour = type) +
 geom_line(size = 1L) +
 scale_color_hue() +
 labs(title = "Average Price has a certain degree of seasonality") +
 theme_minimal() +
 facet_wrap(vars(type))

ggplot(avocado_df_exp) +
 aes(x = type, y = average_price, fill = type) +
 geom_boxplot() +
 scale_fill_hue() +
 labs(title = "As expected average price is higher for organic type") +
 theme_minimal()

ggplot(avocado_df_exp) +
 aes(x = date, weight = total_volume) +
 geom_bar(fill = "#0c4c8a") +
 labs(title = "Total Volume has also a pattern of seasonality") +
 theme_minimal()

1.3 Model development

1.3.1 First Predictor - type

avocado_tidy %>% 
   ggpairs(aes(colour = type, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

avocado_tidy %>% 
   ggpairs(aes(colour = season, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.3.1.1 average price + type

mod_type <- lm(log(average_price) ~ type, data = avocado_tidy)
mod_type
## 
## Call:
## lm(formula = log(average_price) ~ type, data = avocado_tidy)
## 
## Coefficients:
## (Intercept)  typeorganic  
##      0.1222       0.3591
summary(mod_type)
## 
## Call:
## lm(formula = log(average_price) ~ type, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30232 -0.13775  0.00724  0.15524  0.69732 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.122230   0.002517   48.57   <2e-16 ***
## typeorganic 0.359108   0.003559  100.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2317 on 16951 degrees of freedom
## Multiple R-squared:  0.3752, Adjusted R-squared:  0.3752 
## F-statistic: 1.018e+04 on 1 and 16951 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type)

1.3.1.2 average price + total volume

mod_total_volume <- lm(average_price ~ log(total_volume), data = avocado_tidy)
mod_total_volume
## 
## Call:
## lm(formula = average_price ~ log(total_volume), data = avocado_tidy)
## 
## Coefficients:
##       (Intercept)  log(total_volume)  
##            2.5746            -0.1032
summary(mod_total_volume)
## 
## Call:
## lm(formula = average_price ~ log(total_volume), data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06658 -0.23668 -0.03644  0.19778  1.67839 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.574617   0.012777  201.50   <2e-16 ***
## log(total_volume) -0.103156   0.001109  -92.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3327 on 16951 degrees of freedom
## Multiple R-squared:  0.3378, Adjusted R-squared:  0.3378 
## F-statistic:  8647 on 1 and 16951 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_total_volume)

1.3.1.3 average price + month

mod_month <- lm(average_price ~ month, data = avocado_tidy)
mod_month
## 
## Call:
## lm(formula = average_price ~ month, data = avocado_tidy)
## 
## Coefficients:
## (Intercept)      month10      month11      month12       month2       month3  
##     1.28919      0.29050      0.16638      0.04193     -0.02957      0.04178  
##      month4       month5       month6       month7       month8       month9  
##     0.08519      0.05741      0.11978      0.17289      0.22333      0.28347
summary(mod_month)
## 
## Call:
## lm(formula = average_price ~ month, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99265 -0.30265 -0.03265  0.25444  1.79562 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.28919    0.01018 126.589  < 2e-16 ***
## month10      0.29050    0.01440  20.170  < 2e-16 ***
## month11      0.16638    0.01468  11.336  < 2e-16 ***
## month12      0.04193    0.01468   2.856  0.00429 ** 
## month2      -0.02957    0.01499  -1.973  0.04854 *  
## month3       0.04178    0.01468   2.846  0.00443 ** 
## month4       0.08519    0.01468   5.805 6.56e-09 ***
## month5       0.05741    0.01440   3.986 6.74e-05 ***
## month6       0.11978    0.01500   7.987 1.47e-15 ***
## month7       0.17289    0.01440  12.004  < 2e-16 ***
## month8       0.22333    0.01468  15.216  < 2e-16 ***
## month9       0.28347    0.01499  18.910  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.396 on 16941 degrees of freedom
## Multiple R-squared:  0.06224,    Adjusted R-squared:  0.06163 
## F-statistic: 102.2 on 11 and 16941 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_month)

1.3.1.4 average price + week

mod_week <- lm(average_price ~ week, data = avocado_tidy)
mod_week
## 
## Call:
## lm(formula = average_price ~ week, data = avocado_tidy)
## 
## Coefficients:
## (Intercept)         week  
##    1.521376    -0.004322
summary(mod_week)
## 
## Call:
## lm(formula = average_price ~ week, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03138 -0.30958 -0.03357  0.25855  1.80855 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.5213759  0.0061104  248.98   <2e-16 ***
## week        -0.0043223  0.0002052  -21.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4036 on 16951 degrees of freedom
## Multiple R-squared:  0.02551,    Adjusted R-squared:  0.02545 
## F-statistic: 443.8 on 1 and 16951 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_week)

1.3.1.5 average price + season

mod_season <- lm(average_price ~ season, data = avocado_tidy)
mod_season
## 
## Call:
## lm(formula = average_price ~ season, data = avocado_tidy)
## 
## Coefficients:
##  (Intercept)  seasonspring  seasonsummer  seasonwinter  
##      1.53615      -0.18560      -0.07357      -0.24209
summary(mod_season)
## 
## Call:
## lm(formula = average_price ~ season, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95615 -0.30405 -0.03257  0.25595  1.81945 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.536147   0.006130 250.602   <2e-16 ***
## seasonspring -0.185600   0.008615 -21.545   <2e-16 ***
## seasonsummer -0.073574   0.008670  -8.486   <2e-16 ***
## seasonwinter -0.242093   0.008669 -27.925   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3978 on 16949 degrees of freedom
## Multiple R-squared:  0.05314,    Adjusted R-squared:  0.05297 
## F-statistic: 317.1 on 3 and 16949 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_season)

1.3.1.6 average price + year

mod_year <- lm(average_price ~ year, data = avocado_tidy)
mod_year
## 
## Call:
## lm(formula = average_price ~ year, data = avocado_tidy)
## 
## Coefficients:
## (Intercept)     year2016     year2017  
##     1.37559     -0.03695      0.13954
summary(mod_year)
## 
## Call:
## lm(formula = average_price ~ year, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07513 -0.29864 -0.03864  0.25487  1.91136 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.375590   0.005360 256.632  < 2e-16 ***
## year2016    -0.036951   0.007580  -4.875  1.1e-06 ***
## year2017     0.139537   0.007545  18.494  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4017 on 16950 degrees of freedom
## Multiple R-squared:  0.03476,    Adjusted R-squared:  0.03465 
## F-statistic: 305.2 on 2 and 16950 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_year)

1.3.2 Second Predictor - month

remaining_resid <- avocado_tidy %>% 
  add_residuals(mod_type) %>% 
  select(-c(average_price, type))
remaining_resid %>% 
  ggpairs(aes(colour = season, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.3.2.1 avg_p + type + week

mod_type_week <- lm(average_price ~ type + week, data = avocado_tidy)
mod_type_week
## 
## Call:
## lm(formula = average_price ~ type + week, data = avocado_tidy)
## 
## Coefficients:
## (Intercept)  typeorganic         week  
##    1.271166     0.500254    -0.004317
summary(mod_type_week)
## 
## Call:
## lm(formula = average_price ~ type + week, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14577 -0.20821 -0.02663  0.19131  1.55832 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.271166   0.005377  236.40   <2e-16 ***
## typeorganic  0.500254   0.004865  102.83   <2e-16 ***
## week        -0.004317   0.000161  -26.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3167 on 16950 degrees of freedom
## Multiple R-squared:  0.3999, Adjusted R-squared:  0.3998 
## F-statistic:  5648 on 2 and 16950 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_week)

1.3.2.2 avg_p + type + total volume

mod_type_total_volume <- lm(average_price ~ type + log(total_volume), data = avocado_tidy)
mod_type_total_volume
## 
## Call:
## lm(formula = average_price ~ type + log(total_volume), data = avocado_tidy)
## 
## Coefficients:
##       (Intercept)        typeorganic  log(total_volume)  
##           1.75552            0.33349           -0.04535
summary(mod_type_total_volume)
## 
## Call:
## lm(formula = average_price ~ type + log(total_volume), data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14711 -0.20784 -0.02665  0.18296  1.60193 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.755519   0.023313   75.30   <2e-16 ***
## typeorganic        0.333487   0.008093   41.21   <2e-16 ***
## log(total_volume) -0.045349   0.001757  -25.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3172 on 16950 degrees of freedom
## Multiple R-squared:  0.3981, Adjusted R-squared:  0.398 
## F-statistic:  5606 on 2 and 16950 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_total_volume)

1.3.2.3 avg_p + type + year

mod_type_year <- lm(average_price ~ type + year, data = avocado_tidy)
mod_type_year
## 
## Call:
## lm(formula = average_price ~ type + year, data = avocado_tidy)
## 
## Coefficients:
## (Intercept)  typeorganic     year2016     year2017  
##      1.1255       0.5003      -0.0370       0.1396
summary(mod_type_year)
## 
## Call:
## lm(formula = average_price ~ type + year, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32537 -0.18880 -0.01548  0.18463  1.66120 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.125478   0.004838 232.624  < 2e-16 ***
## typeorganic  0.500313   0.004827 103.653  < 2e-16 ***
## year2016    -0.036995   0.005930  -6.238 4.53e-10 ***
## year2017     0.139580   0.005903  23.647  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3142 on 16949 degrees of freedom
## Multiple R-squared:  0.4092, Adjusted R-squared:  0.4091 
## F-statistic:  3914 on 3 and 16949 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_year)

1.3.2.4 avg_p + type + month

mod_type_month <- lm(average_price ~ type + month, data = avocado_tidy)
mod_type_month
## 
## Call:
## lm(formula = average_price ~ type + month, data = avocado_tidy)
## 
## Coefficients:
## (Intercept)  typeorganic      month10      month11      month12       month2  
##     1.03904      0.50028      0.29050      0.16638      0.04210     -0.02957  
##      month3       month4       month5       month6       month7       month8  
##     0.04178      0.08519      0.05741      0.12016      0.17289      0.22333  
##      month9  
##     0.28347
summary(mod_type_month)
## 
## Call:
## lm(formula = average_price ~ type + month, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14982 -0.20115 -0.02194  0.19046  1.54548 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.039045   0.008238 126.129  < 2e-16 ***
## typeorganic  0.500283   0.004715 106.112  < 2e-16 ***
## month10      0.290496   0.011163  26.023  < 2e-16 ***
## month11      0.166376   0.011376  14.626  < 2e-16 ***
## month12      0.042104   0.011378   3.701 0.000216 ***
## month2      -0.029572   0.011619  -2.545 0.010930 *  
## month3       0.041775   0.011376   3.672 0.000241 ***
## month4       0.085194   0.011376   7.489 7.27e-14 ***
## month5       0.057414   0.011163   5.143 2.73e-07 ***
## month6       0.120165   0.011624  10.338  < 2e-16 ***
## month7       0.172890   0.011163  15.488  < 2e-16 ***
## month8       0.223328   0.011376  19.632  < 2e-16 ***
## month9       0.283468   0.011619  24.397  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3069 on 16940 degrees of freedom
## Multiple R-squared:  0.4367, Adjusted R-squared:  0.4363 
## F-statistic:  1094 on 12 and 16940 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_month)

1.3.2.4.1 running avova() to confirm ‘month’
anova(mod_type, mod_type_month)
## Warning in anova.lmlist(object, ...): models with response '"average_price"'
## removed because response differs from model 1

1.3.2.5 avg_p + type + season

mod_type_season <- lm(average_price ~ type + season, data = avocado_tidy)
mod_type_season
## 
## Call:
## lm(formula = average_price ~ type + season, data = avocado_tidy)
## 
## Coefficients:
##  (Intercept)   typeorganic  seasonspring  seasonsummer  seasonwinter  
##      1.28600       0.50029      -0.18560      -0.07346      -0.24203
summary(mod_type_season)
## 
## Call:
## lm(formula = average_price ~ type + season, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16069 -0.20284 -0.02284  0.18931  1.56931 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.286001   0.005325  241.48   <2e-16 ***
## typeorganic   0.500291   0.004751  105.29   <2e-16 ***
## seasonspring -0.185600   0.006698  -27.71   <2e-16 ***
## seasonsummer -0.073455   0.006741  -10.90   <2e-16 ***
## seasonwinter -0.242034   0.006741  -35.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3093 on 16948 degrees of freedom
## Multiple R-squared:  0.4276, Adjusted R-squared:  0.4274 
## F-statistic:  3165 on 4 and 16948 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_season)

1.3.3 Third Predictor - year

remaining_resid <- avocado_tidy %>% 
  add_residuals(mod_month) %>% 
  select(-c(average_price, type, month))
remaining_resid %>% 
  ggpairs(aes(colour = season, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.3.3.1 avg_p + type + month + total volume

mod_type_month_total_volume <- lm(average_price ~ type + month + log(total_volume), data = avocado_tidy)
mod_type_month_total_volume
## 
## Call:
## lm(formula = average_price ~ type + month + log(total_volume), 
##     data = avocado_tidy)
## 
## Coefficients:
##       (Intercept)        typeorganic            month10            month11  
##           1.60846            0.34000            0.28755            0.16171  
##           month12             month2             month3             month4  
##           0.04255           -0.02532            0.04463            0.09181  
##            month5             month6             month7             month8  
##           0.06661            0.12723            0.17821            0.22518  
##            month9  log(total_volume)  
##           0.28350           -0.04358
summary(mod_type_month_total_volume)
## 
## Call:
## lm(formula = average_price ~ type + month + log(total_volume), 
##     data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.13248 -0.19748 -0.01783  0.17872  1.47888 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.608457   0.023279  69.096  < 2e-16 ***
## typeorganic        0.340002   0.007690  44.213  < 2e-16 ***
## month10            0.287554   0.010946  26.269  < 2e-16 ***
## month11            0.161713   0.011156  14.496  < 2e-16 ***
## month12            0.042553   0.011156   3.814 0.000137 ***
## month2            -0.025317   0.011394  -2.222 0.026299 *  
## month3             0.044628   0.011155   4.001 6.34e-05 ***
## month4             0.091807   0.011157   8.228  < 2e-16 ***
## month5             0.066609   0.010951   6.082 1.21e-09 ***
## month6             0.127226   0.011401  11.160  < 2e-16 ***
## month7             0.178207   0.010948  16.278  < 2e-16 ***
## month8             0.225184   0.011154  20.188  < 2e-16 ***
## month9             0.283504   0.011393  24.885  < 2e-16 ***
## log(total_volume) -0.043575   0.001671 -26.081  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.301 on 16939 degrees of freedom
## Multiple R-squared:  0.4584, Adjusted R-squared:  0.458 
## F-statistic:  1103 on 13 and 16939 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_month_total_volume)

1.3.3.2 avg_p + type + month + season

mod_type_month_season <- lm(average_price ~ type + month + season, data = avocado_tidy)
mod_type_month_season
## 
## Call:
## lm(formula = average_price ~ type + month + season, data = avocado_tidy)
## 
## Coefficients:
##  (Intercept)   typeorganic       month10       month11       month12  
##      1.03904       0.50028       0.29050       0.16638       0.04210  
##       month2        month3        month4        month5        month6  
##     -0.02957       0.04178       0.08519       0.05741       0.12016  
##       month7        month8        month9  seasonspring  seasonsummer  
##      0.17289       0.22333       0.28347            NA            NA  
## seasonwinter  
##           NA
summary(mod_type_month_season)
## 
## Call:
## lm(formula = average_price ~ type + month + season, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14982 -0.20115 -0.02194  0.19046  1.54548 
## 
## Coefficients: (3 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.039045   0.008238 126.129  < 2e-16 ***
## typeorganic   0.500283   0.004715 106.112  < 2e-16 ***
## month10       0.290496   0.011163  26.023  < 2e-16 ***
## month11       0.166376   0.011376  14.626  < 2e-16 ***
## month12       0.042104   0.011378   3.701 0.000216 ***
## month2       -0.029572   0.011619  -2.545 0.010930 *  
## month3        0.041775   0.011376   3.672 0.000241 ***
## month4        0.085194   0.011376   7.489 7.27e-14 ***
## month5        0.057414   0.011163   5.143 2.73e-07 ***
## month6        0.120165   0.011624  10.338  < 2e-16 ***
## month7        0.172890   0.011163  15.488  < 2e-16 ***
## month8        0.223328   0.011376  19.632  < 2e-16 ***
## month9        0.283468   0.011619  24.397  < 2e-16 ***
## seasonspring        NA         NA      NA       NA    
## seasonsummer        NA         NA      NA       NA    
## seasonwinter        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3069 on 16940 degrees of freedom
## Multiple R-squared:  0.4367, Adjusted R-squared:  0.4363 
## F-statistic:  1094 on 12 and 16940 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_month_season)

1.3.3.3 avg_p + type + month + year

mod_type_month_year <- lm(average_price ~ type + month + year, data = avocado_tidy)
mod_type_month_year
## 
## Call:
## lm(formula = average_price ~ type + month + year, data = avocado_tidy)
## 
## Coefficients:
## (Intercept)  typeorganic      month10      month11      month12       month2  
##     1.00211      0.50030      0.29050      0.17149      0.03636     -0.02711  
##      month3       month4       month5       month6       month7       month8  
##     0.04689      0.07948      0.06746      0.12279      0.17289      0.22844  
##      month9     year2016     year2017  
##     0.28593     -0.03730      0.14070
summary(mod_type_month_year)
## 
## Call:
## lm(formula = average_price ~ type + month + year, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25000 -0.18917 -0.01311  0.18043  1.49439 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.002107   0.008701 115.174  < 2e-16 ***
## typeorganic  0.500303   0.004565 109.594  < 2e-16 ***
## month10      0.290496   0.010809  26.876  < 2e-16 ***
## month11      0.171489   0.011025  15.554  < 2e-16 ***
## month12      0.036364   0.011019   3.300 0.000969 ***
## month2      -0.027110   0.011253  -2.409 0.015996 *  
## month3       0.046888   0.011025   4.253 2.12e-05 ***
## month4       0.079484   0.011017   7.214 5.65e-13 ***
## month5       0.067464   0.010816   6.237 4.56e-10 ***
## month6       0.122791   0.011257  10.908  < 2e-16 ***
## month7       0.172890   0.010809  15.995  < 2e-16 ***
## month8       0.228441   0.011025  20.720  < 2e-16 ***
## month9       0.285930   0.011253  25.410  < 2e-16 ***
## year2016    -0.037298   0.005621  -6.636 3.32e-11 ***
## year2017     0.140698   0.005600  25.122  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2972 on 16938 degrees of freedom
## Multiple R-squared:  0.4719, Adjusted R-squared:  0.4715 
## F-statistic:  1081 on 14 and 16938 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_month_year)

1.3.4 Forth Predictor - total volume

remaining_resid <- avocado_tidy %>% 
  add_residuals(mod_year) %>% 
  select(-c(average_price, type, month, year))
remaining_resid %>% 
  ggpairs(aes(colour = season, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.3.4.1 avg_p + type + month + year + total volume

mod_type_month_year_total_volume <- lm(average_price ~ type + month + year + total_volume, data = avocado_tidy)
mod_type_month_year_total_volume
## 
## Call:
## lm(formula = average_price ~ type + month + year + total_volume, 
##     data = avocado_tidy)
## 
## Coefficients:
##  (Intercept)   typeorganic       month10       month11       month12  
##    1.011e+00     4.912e-01     2.894e-01     1.704e-01     3.578e-02  
##       month2        month3        month4        month5        month6  
##   -2.653e-02     4.667e-02     7.951e-02     6.805e-02     1.231e-01  
##       month7        month8        month9      year2016      year2017  
##    1.728e-01     2.281e-01     2.852e-01    -3.686e-02     1.412e-01  
## total_volume  
##   -5.769e-09
summary(mod_type_month_year_total_volume)
## 
## Call:
## lm(formula = average_price ~ type + month + year + total_volume, 
##     data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25008 -0.18825 -0.01059  0.17964  1.49500 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.011e+00  8.755e-03 115.526  < 2e-16 ***
## typeorganic   4.912e-01  4.685e-03 104.844  < 2e-16 ***
## month10       2.894e-01  1.079e-02  26.822  < 2e-16 ***
## month11       1.704e-01  1.100e-02  15.485  < 2e-16 ***
## month12       3.578e-02  1.100e-02   3.253  0.00114 ** 
## month2       -2.653e-02  1.123e-02  -2.362  0.01817 *  
## month3        4.667e-02  1.100e-02   4.242 2.23e-05 ***
## month4        7.951e-02  1.100e-02   7.231 4.99e-13 ***
## month5        6.805e-02  1.079e-02   6.304 2.98e-10 ***
## month6        1.231e-01  1.123e-02  10.957  < 2e-16 ***
## month7        1.728e-01  1.079e-02  16.018  < 2e-16 ***
## month8        2.281e-01  1.100e-02  20.727  < 2e-16 ***
## month9        2.852e-01  1.123e-02  25.399  < 2e-16 ***
## year2016     -3.686e-02  5.610e-03  -6.571 5.15e-11 ***
## year2017      1.412e-01  5.590e-03  25.257  < 2e-16 ***
## total_volume -5.769e-09  6.932e-10  -8.323  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2966 on 16937 degrees of freedom
## Multiple R-squared:  0.4741, Adjusted R-squared:  0.4736 
## F-statistic:  1018 on 15 and 16937 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_month_year_total_volume)

1.3.4.2 avg_p + type + month + year + week

mod_type_month_year_week <- lm(average_price ~ type + month + year + week, data = avocado_tidy)
mod_type_month_year_week
## 
## Call:
## lm(formula = average_price ~ type + month + year + week, data = avocado_tidy)
## 
## Coefficients:
## (Intercept)  typeorganic      month10      month11      month12       month2  
##    1.398290     0.500213    -0.022093    -0.177098    -0.347067    -0.061808  
##      month3       month4       month5       month6       month7       month8  
##   -0.021168    -0.023442    -0.071366    -0.050825    -0.035485    -0.015932  
##      month9     year2016     year2017         week  
##    0.008115    -0.039921     0.145008    -0.008016
summary(mod_type_month_year_week)
## 
## Call:
## lm(formula = average_price ~ type + month + year + week, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23763 -0.18814 -0.01214  0.18029  1.49463 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.398290   0.089820  15.568  < 2e-16 ***
## typeorganic  0.500213   0.004563 109.633  < 2e-16 ***
## month10     -0.022093   0.071357  -0.310   0.7569    
## month11     -0.177098   0.079426  -2.230   0.0258 *  
## month12     -0.347067   0.087218  -3.979 6.94e-05 ***
## month2      -0.061808   0.013703  -4.510 6.51e-06 ***
## month3      -0.021168   0.018901  -1.120   0.2628    
## month4      -0.023442   0.025703  -0.912   0.3618    
## month5      -0.071366   0.033139  -2.154   0.0313 *  
## month6      -0.050825   0.040759  -1.247   0.2124    
## month7      -0.035485   0.048244  -0.736   0.4620    
## month8      -0.015932   0.056232  -0.283   0.7769    
## month9       0.008115   0.063689   0.127   0.8986    
## year2016    -0.039921   0.005649  -7.067 1.64e-12 ***
## year2017     0.145008   0.005681  25.524  < 2e-16 ***
## week        -0.008016   0.001809  -4.432 9.41e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.297 on 16937 degrees of freedom
## Multiple R-squared:  0.4725, Adjusted R-squared:  0.4721 
## F-statistic:  1012 on 15 and 16937 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_month_year_week)

1.3.4.3 avg_p + type + month + year + season

mod_type_month_year_season <- lm(average_price ~ type + month + year + season, data = avocado_tidy)
mod_type_month_year_season
## 
## Call:
## lm(formula = average_price ~ type + month + year + season, data = avocado_tidy)
## 
## Coefficients:
##  (Intercept)   typeorganic       month10       month11       month12  
##      1.00211       0.50030       0.29050       0.17149       0.03636  
##       month2        month3        month4        month5        month6  
##     -0.02711       0.04689       0.07948       0.06746       0.12279  
##       month7        month8        month9      year2016      year2017  
##      0.17289       0.22844       0.28593      -0.03730       0.14070  
## seasonspring  seasonsummer  seasonwinter  
##           NA            NA            NA
summary(mod_type_month_year_season)
## 
## Call:
## lm(formula = average_price ~ type + month + year + season, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25000 -0.18917 -0.01311  0.18043  1.49439 
## 
## Coefficients: (3 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.002107   0.008701 115.174  < 2e-16 ***
## typeorganic   0.500303   0.004565 109.594  < 2e-16 ***
## month10       0.290496   0.010809  26.876  < 2e-16 ***
## month11       0.171489   0.011025  15.554  < 2e-16 ***
## month12       0.036364   0.011019   3.300 0.000969 ***
## month2       -0.027110   0.011253  -2.409 0.015996 *  
## month3        0.046888   0.011025   4.253 2.12e-05 ***
## month4        0.079484   0.011017   7.214 5.65e-13 ***
## month5        0.067464   0.010816   6.237 4.56e-10 ***
## month6        0.122791   0.011257  10.908  < 2e-16 ***
## month7        0.172890   0.010809  15.995  < 2e-16 ***
## month8        0.228441   0.011025  20.720  < 2e-16 ***
## month9        0.285930   0.011253  25.410  < 2e-16 ***
## year2016     -0.037298   0.005621  -6.636 3.32e-11 ***
## year2017      0.140698   0.005600  25.122  < 2e-16 ***
## seasonspring        NA         NA      NA       NA    
## seasonsummer        NA         NA      NA       NA    
## seasonwinter        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2972 on 16938 degrees of freedom
## Multiple R-squared:  0.4719, Adjusted R-squared:  0.4715 
## F-statistic:  1081 on 14 and 16938 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_type_month_year_season)

1.3.5 Interactions

average_price_residual <- avocado_tidy %>% 
  add_residuals(mod_type_month_year_total_volume) %>% 
  select(-average_price)
coplot(resid ~ log(total_volume) | month,
       panel = function(x, y, ...){
         points(x, y)
         abline(lm(y ~ x), col = "blue")
       },
       data = average_price_residual, columns=6)

average_price_residual %>%
  ggplot(aes(x = log(total_volume), y = resid, colour = type)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

1.3.5.1 type - month

mod_int_t1 <- lm(average_price ~ type + month + year + total_volume + type:month, data = avocado_tidy)
summary(mod_int_t1)
## 
## Call:
## lm(formula = average_price ~ type + month + year + total_volume + 
##     type:month, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20608 -0.18751 -0.01144  0.17733  1.51395 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.010e+00  1.137e-02  88.794  < 2e-16 ***
## typeorganic          4.948e-01  1.527e-02  32.399  < 2e-16 ***
## month10              3.101e-01  1.523e-02  20.360  < 2e-16 ***
## month11              1.720e-01  1.553e-02  11.076  < 2e-16 ***
## month12              3.353e-02  1.552e-02   2.160   0.0308 *  
## month2              -3.410e-02  1.585e-02  -2.151   0.0315 *  
## month3               9.247e-02  1.553e-02   5.956 2.64e-09 ***
## month4               9.964e-02  1.552e-02   6.420 1.40e-10 ***
## month5               6.373e-02  1.523e-02   4.183 2.89e-05 ***
## month6               1.153e-01  1.585e-02   7.271 3.73e-13 ***
## month7               1.753e-01  1.523e-02  11.510  < 2e-16 ***
## month8               2.027e-01  1.553e-02  13.057  < 2e-16 ***
## month9               2.588e-01  1.585e-02  16.326  < 2e-16 ***
## year2016            -3.686e-02  5.600e-03  -6.582 4.76e-11 ***
## year2017             1.412e-01  5.580e-03  25.301  < 2e-16 ***
## total_volume        -5.749e-09  6.923e-10  -8.305  < 2e-16 ***
## typeorganic:month10 -4.151e-02  2.154e-02  -1.927   0.0540 .  
## typeorganic:month11 -3.191e-03  2.195e-02  -0.145   0.8844    
## typeorganic:month12  4.502e-03  2.195e-02   0.205   0.8375    
## typeorganic:month2   1.513e-02  2.242e-02   0.675   0.4997    
## typeorganic:month3  -9.159e-02  2.195e-02  -4.173 3.02e-05 ***
## typeorganic:month4  -4.027e-02  2.195e-02  -1.835   0.0665 .  
## typeorganic:month5   8.631e-03  2.154e-02   0.401   0.6886    
## typeorganic:month6   1.572e-02  2.243e-02   0.701   0.4834    
## typeorganic:month7  -5.002e-03  2.154e-02  -0.232   0.8164    
## typeorganic:month8   5.065e-02  2.195e-02   2.308   0.0210 *  
## typeorganic:month9   5.282e-02  2.242e-02   2.356   0.0185 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2961 on 16926 degrees of freedom
## Multiple R-squared:  0.4762, Adjusted R-squared:  0.4754 
## F-statistic: 591.9 on 26 and 16926 DF,  p-value: < 2.2e-16

1.3.5.2 month - year

mod_int_t2 <- lm(average_price ~ type + month + year + total_volume + month:year, data = avocado_tidy)
summary(mod_int_t2)
## 
## Call:
## lm(formula = average_price ~ type + month + year + total_volume + 
##     month:year, data = avocado_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20897 -0.17156 -0.01063  0.16921  1.44350 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.123e+00  1.400e-02  80.230  < 2e-16 ***
## typeorganic       4.917e-01  4.527e-03 108.630  < 2e-16 ***
## month10           2.677e-02  1.950e-02   1.373 0.169853    
## month11          -3.472e-02  1.850e-02  -1.877 0.060550 .  
## month12          -5.950e-02  1.951e-02  -3.050 0.002294 ** 
## month2           -3.754e-02  1.950e-02  -1.925 0.054207 .  
## month3           -2.854e-03  1.850e-02  -0.154 0.877401    
## month4            1.873e-02  1.950e-02   0.961 0.336758    
## month5           -1.949e-02  1.850e-02  -1.054 0.291990    
## month6            3.483e-02  1.950e-02   1.786 0.074076 .  
## month7            4.488e-02  1.950e-02   2.302 0.021353 *  
## month8            7.965e-02  1.850e-02   4.306 1.67e-05 ***
## month9            8.424e-02  1.950e-02   4.320 1.57e-05 ***
## year2016         -1.241e-01  1.850e-02  -6.708 2.04e-11 ***
## year2017         -8.618e-02  1.850e-02  -4.659 3.21e-06 ***
## total_volume     -5.437e-09  6.698e-10  -8.116 5.13e-16 ***
## month10:year2016  2.890e-01  2.616e-02  11.047  < 2e-16 ***
## month11:year2016  3.430e-01  2.616e-02  13.113  < 2e-16 ***
## month12:year2016  1.347e-01  2.689e-02   5.010 5.50e-07 ***
## month2:year2016   3.507e-02  2.688e-02   1.305 0.191959    
## month3:year2016  -1.298e-02  2.616e-02  -0.496 0.619735    
## month4:year2016  -5.362e-02  2.688e-02  -1.995 0.046049 *  
## month5:year2016  -2.011e-02  2.542e-02  -0.791 0.429052    
## month6:year2016   8.418e-03  2.688e-02   0.313 0.754129    
## month7:year2016   1.162e-01  2.616e-02   4.441 9.00e-06 ***
## month8:year2016   9.115e-02  2.616e-02   3.484 0.000494 ***
## month9:year2016   1.032e-01  2.688e-02   3.840 0.000123 ***
## month10:year2017  4.465e-01  2.616e-02  17.066  < 2e-16 ***
## month11:year2017  2.732e-01  2.616e-02  10.444  < 2e-16 ***
## month12:year2017  1.451e-01  2.617e-02   5.545 2.98e-08 ***
## month2:year2017  -2.460e-02  2.688e-02  -0.915 0.359991    
## month3:year2017   1.234e-01  2.616e-02   4.718 2.40e-06 ***
## month4:year2017   2.059e-01  2.616e-02   7.872 3.69e-15 ***
## month5:year2017   2.746e-01  2.616e-02  10.496  < 2e-16 ***
## month6:year2017   2.340e-01  2.689e-02   8.702  < 2e-16 ***
## month7:year2017   2.420e-01  2.616e-02   9.249  < 2e-16 ***
## month8:year2017   3.407e-01  2.616e-02  13.023  < 2e-16 ***
## month9:year2017   4.774e-01  2.688e-02  17.763  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2866 on 16915 degrees of freedom
## Multiple R-squared:  0.5097, Adjusted R-squared:  0.5086 
## F-statistic: 475.2 on 37 and 16915 DF,  p-value: < 2.2e-16
relaimpo::calc.relimp(mod_type_month_year_total_volume, type = "lmg", rela = TRUE)
## Response variable: average_price 
## Total response variance: 0.1671173 
## Analysis based on 16953 observations 
## 
## 15 Regressors: 
## Some regressors combined in groups: 
##         Group  month : month10 month11 month12 month2 month3 month4 month5 month6 month7 month8 month9 
##         Group  year : year2016 year2017 
## 
##  Relative importance of 4 (groups of) regressors assessed: 
##  month year type total_volume 
##  
## Proportion of variance explained by model: 47.41%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                     lmg
## month        0.13078681
## year         0.07399211
## type         0.75456058
## total_volume 0.04066050
## 
## Average coefficients for different model sizes: 
## 
##                     1group       2groups       3groups       4groups
## type          5.002927e-01  4.969983e-01  4.939747e-01  4.912087e-01
## month10       2.904960e-01  2.890097e-01  2.886308e-01  2.893588e-01
## month11       1.663762e-01  1.665884e-01  1.679718e-01  1.703928e-01
## month12       4.192540e-02  3.929673e-02  3.725483e-02  3.577504e-02
## month2       -2.957231e-02 -2.802169e-02 -2.698635e-02 -2.653008e-02
## month3        4.177503e-02  4.313927e-02  4.481640e-02  4.667318e-02
## month4        8.519383e-02  8.331178e-02  8.142483e-02  7.950803e-02
## month5        5.741402e-02  6.148067e-02  6.505841e-02  6.804710e-02
## month6        1.197779e-01  1.211728e-01  1.223033e-01  1.231046e-01
## month7        1.728902e-01  1.727509e-01  1.727154e-01  1.727836e-01
## month8        2.233277e-01  2.244754e-01  2.260974e-01  2.280602e-01
## month9        2.834678e-01  2.833520e-01  2.839625e-01  2.852349e-01
## year2016     -3.695078e-02 -3.646617e-02 -3.644299e-02 -3.685972e-02
## year2017      1.395372e-01  1.405557e-01  1.411118e-01  1.411752e-01
## total_volume -2.318360e-08 -1.739063e-08 -1.158561e-08 -5.769023e-09